Source: Chollet et al., Deep Learning With R
source('src/lib.R')
Intuitive representation of how a cat face is learnt
Besides the dense (i.e. fully connected) layer we already saw with multilayer perceptrons, there are two main structures we need to master to build a convolutional neural network.
Hence the name of convnets. What they do is to create many small fully connected networks (called - we are sorry - kernels) that sweep along the input. Being so small, they cannot learn anything but local patterns.
3x3 convolution
lsp(keras, pattern = 'layer_conv')
# let us istantiate the first convolutional layer of the network for the MNIST example
model_cnn = keras_model_sequential() %>%
layer_conv_2d(input_shape = c(28, 28, 1),
kernel_size = c(3, 3),
filters = 32,
activation = "relu",
padding = 'valid',
strides = 1)
note that:
$\texttt{filters}$ indicates how many kernels we are setting up for this layer. Please note that ideally, each kernel will learn a different feature of the input, but every kernel will sweep along the entire input.
$\texttt{padding}$ specifies whether the convolutional kernel can overshoot the boundaries of the input. This is usually done if one wants to obtain a convoluted image which is the same size of the output. Default is no padding. ($\texttt{padding = 'same'}$)
3x3 convolution, with padding = same
3x3 convolution, with strides = 2
Each kernel outputs the dot product of its (learned) weights times its input. The result is a filtered representation of the input value, called feature map. Each of them should emphasize a peculiar aspect of it.
from 3x3 convolution to feature map
A different number of feature maps will be produced according to how many kernels have been specified by the $\texttt{filters}$ parameter.
1 input image
32 feature maps
To reduce the dimensionality of the problem we apply a pooling layer after each iteration: pooling layers are nothing more than hard-coded 2x2 convolutional layers with strides = 2 which are used to downsample the feature map (usually take the maximum or by averaging the input).
Applying a pooling layer allows to:
lsp(keras, pattern = 'pooling')
model_cnn = model_cnn %>% layer_max_pooling_2d(pool_size = c(2, 2),
padding='valid')
note that:
we can now finish instantiating the sequence of convolutional and pooling layers
(model_cnn = model_cnn %>% layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = "relu") %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = "relu"))
At this point we obtained a very detailed feature representation of the input. Our contains 64 3x3 feature maps! It is time to use these features to classify the input!! How? Simple! With an MLP! (early image classification models used a Support Vector Machine instead of MLPs, though it wasn't easily extensible to more than two classes)
(model_cnn = model_cnn %>%
layer_flatten() %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 10, activation = "softmax"))
where:
mnist = dataset_mnist()
c(c(train_images, train_labels), c(test_images, test_labels)) %<-% mnist
## side note: "%<-%" is the Multiple Assignment Operator
## it is a keras-defined function that concurrently assigns multiple values to a series of variables
## the data in input must be in a named list
## indeed, in our case, the mnist dataset is a list with the following structure:
## mnist
## . $train
## .. $x
## .. $y
## . $test
## .. $x
## .. $y
## here, the "%<-%" assign the contents of the mnis list object to 4 independent variables
## the benefit of "%<-%" is that it is possible to do it in a single step, instead of four
## yes it's pratically useless :) but the Multiple Assignment approach comes from Python
## (and Keras/TF are "Pythonic" packages)
train_images = array_reshape(train_images, c(60000, 28, 28, 1))
train_images = train_images / 255
test_images = array_reshape(test_images, c(10000, 28, 28, 1))
test_images = test_images / 255
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
model_cnn %>% compile(
optimizer = "rmsprop",
loss = "categorical_crossentropy",
metrics = c("accuracy")
)
model_cnn %>% fit(
train_images,
train_labels,
epochs = 5,
batch_size=64
)
model_cnn$save('data/mnist_cnn.h5')
model_cnn$save_weights('data/mnist_cnn_w.h5')
model_cnn = load_model_hdf5('data/mnist_cnn.h5')
model_cnn$load_weights('data/mnist_cnn_w.h5')
model_cnn %>% evaluate(test_images, test_labels)
There we have! A significant improvement with respect to the MLP! (c.f. previous notebooks)
Now that we have learned how to build a CNN, let's try to understand better what happens under the hood.
We said CNNs learn a feature representation of the input. Let's try to break down our network and look to the intermediate layers!
model_cnn$layers %>% class # we can access the layers stored a R list
model_cnn$layers %>% length # there are 8 layers
model_cnn$layers
# we create a tensor function that takes in input an image and retyren 3 different tenstos corresponding to the
# intermediate feature map of the CNN we just trained
# the tensot function is initialized with the actual weights of the trained CNN.
(activation_model = keras_model(inputs = model_cnn$input,
outputs = list(model_cnn$layers[[1]]$output, # one
#model_cnn$layers[[2]]$output,
model_cnn$layers[[3]]$output, # two
#model_cnn$layers[[4]]$output,
model_cnn$layers[[5]]$output))) # and three
batch = array_reshape(test_images[1,,,], c(1, 28, 28, 1))
activations = activation_model %>% predict(batch)
# input image
options(repr.plot.width=1, repr.plot.height=1)
par(mar = c(0.1, 0.1, 0.1, 0.1))
batch[,,,1] %>% as.raster %>% plot
# first convolutional layer (32 feature maps)
options(repr.plot.width=10, repr.plot.height=1.2)
par(mar = c(0.1, 0.1, 0.1, 0.1), mfrow=c(2,16))
layer = 1
for(i in seq_len(32)){
activations[[layer]][,,,i] %>% range01 %>% as.raster %>% plot
}
# second convolutional layer (64 feature maps)
options(repr.plot.width=10, repr.plot.height=2.4)
par(mar = c(0.1, 0.1, 0.1, 0.1), mfrow=c(4,16))
layer = 2
for(i in seq_len(64)){
activations[[layer]][,,,i] %>% range01 %>% as.raster %>% plot
}
# third convolutional layer (64 feature maps)
options(repr.plot.width=10, repr.plot.height=2.4)
par(mar = c(0.1, 0.1, 0.1, 0.1), mfrow=c(4,16))
layer = 3
for(i in seq_len(64)){
activations[[layer]][,,,i] %>% range01 %>% as.raster %>% plot
}
All in all it is possible to see that the network progressively decompose the input in simpler feature maps.
It is not straightforward, but - given a convolutional filter (i.e. a kernel) - we can iteratively build for the input that maximize the response (i.e. kernel's dot product) to that chosen filter. Such inputs will have the shape to which every given filter is most sensitive to.
model_cnn
# backend give access to soma low level TensorFlow operations
K <- backend()
# we select a layer
layer_name <- "conv2d_1"
(layer_output = get_layer(model_cnn, layer_name)$output)
# we select one among the n convolutional kernels of the chosen layer
filter_index = 2
kernel = layer_output[,,,filter_index]
#we set the loss function to be maximized as the mean of the output of that filter
(loss = K$mean(kernel))
# we create the Jacobian matrix of gradients of the loss function we created right above
# the Jacobian matrix is a matrix of partial derivatives.
# remember we want to know how the input must change to maximize the output (quinto anno liceo scientifico!)
(grads = K$gradients(loss, model_cnn$input)[[1]])
# we normalize the Jacobian by its L2 norm (i.e. the square root of sum of squares)
# in order to prevent exploding gradients
# (and sum small number to jic grads = 0 [and we accidentaly divide by zero])
grads = grads / (K$sqrt(K$mean(K$square(grads))) + 1e-5)
# this function simply takes an image in input and returns the loss and the gradient as we specified above!
(iterate = K$`function`(inputs = c(model_cnn$input),
outputs = c(loss, grads)))
c(loss_value, grads_value) %<-% iterate(list(array(0, dim = c(1, 28, 28, 1)))) # multiple assignment operator (see explanation above!)
# create a white noise image
input_img_data = array(runif(28 * 28 * 1), dim = c(1, 28, 28, 1))
options(repr.plot.width=2, repr.plot.height=2)
par(mar = c(0.1, 0.1, 0.1, 0.1))
input_img_data[1,,,] %>% range01 %>% as.raster %>% plot
# perform gradient ascent and progressively reconstruct the image to which the kernel filter is most sensible to
for (i in 1:40) {
c(loss_value, grads_value) %<-% iterate(list(input_img_data))
input_img_data <- input_img_data + grads_value
}
options(repr.plot.width=2, repr.plot.height=2)
par(mar = c(0.1, 0.1, 0.1, 0.1))
input_img_data[1,,,] %>% range01 %>% as.raster %>% plot
Looks like this filter is sensible to vertical patterns!
# let's put everything together in a nice function
get_cnn_filters = function(layer_name, filters){
layer_output = get_layer(model_cnn, layer_name)$output
options(repr.plot.width=10, repr.plot.height=0.6*filters/16)
par(mar = c(0.1, 0.1, 0.1, 0.1), mfrow=c(filters/16,16))
for(filter_index in seq_len(filters)){
loss = K$mean(layer_output[,,,filter_index])
grads = K$gradients(loss, model_cnn$input)[[1]]
grads = grads / (K$sqrt(K$mean(K$square(grads))) + 1e-5)
iterate = K$`function`(inputs = c(model_cnn$input), outputs = c(loss, grads))
c(loss_value, grads_value) %<-% iterate(list(array(0, dim = c(1, 28, 28, 1))))
input_img_data = array(runif(28 * 28 * 1), dim = c(1, 28, 28, 1))
for (i in 1:40) {
c(loss_value, grads_value) %<-% iterate(list(input_img_data))
input_img_data = input_img_data + grads_value
}
input_img_data[1,,,] %>% range01 %>% as.raster %>% plot
}
}
...and plot some stuff!
The first layer is sensible to horizontal/vertical/diagonal patterns
get_cnn_filters("conv2d_1", 32)
The second layer is still sensible to the same types of pattern yet localized in specific region of the image. Some filters haven't learnt anything (white noise).
get_cnn_filters("conv2d_2", 64)
The third and last convolutional layer start to be sensible to curved patterns. Again, some filters haven't learnt anything (white noise).
get_cnn_filters("conv2d_3", 64)
All in all, it is possible to observe that the network progressively learns more complex patterns.
Have you ever seen those trippy images from Google's Deep Dream? Well, now wonder how they're produced...
What about propagating a Jennifer Lawrence's pic through a CNN trained to discriminate dogs and cats?
What about this nice pet?
with small data, it is sometime useful to generate more data by artificially manipulating the data available. See how we change the digit 7 with the keras functions $\texttt{image}$_$\texttt{data}$_$\texttt{generator}$ and $\texttt{flow}$_$\texttt{images}$_$\texttt{from}$_$\texttt{data}$.
batch = array_reshape(test_images[1,,,], c(1, 28, 28, 1))
batch_large = array_reshape(c(batch, batch, batch, batch), c(4, 28, 28, 1))
datagen = image_data_generator(rescale = 1,
rotation_range = 90,
width_shift_range = 0.2,
height_shift_range = 0.2,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = F,
fill_mode = "nearest"
)
augmentation_generator <- flow_images_from_data(batch_large,
generator = datagen,
batch_size = 4
)
distorted = generator_next(augmentation_generator)
options(repr.plot.width=5, repr.plot.height=1)
par(mar = c(0.1, 0.1, 0.1, 0.1), mfrow=c(1,4))
for(i in 1:4){
distorted[i,,,1] %>% as.raster %>% plot
}